I endeavour in this project out of my own volition. It is the first time that I come in contact with this data. I carry no bias before carrying this Exploratory Data Analysis, and the primary goal is to further my knowledge in the methods as well as an understanding of different areas through data analysis.
The full name of the data set is “Inpatient Prospective Payment System (IPPS) Provider Summary for the Top 100 Diagnosis-Related Groups (DRG) - FY2011”. It is own and updated by CMS.
The last update was on May 29, 2014, and the metadata was updated on August 30, 2017. For access to the data set as well as further information, please go to this link.
The Dataset has 12 columns. These are the names and the description of them:
The code and description identifying the MS-DRG. MS-DRGs are a
classification system that groups similar clinical conditions (diagnoses)
and the procedures furnished by the hospital during the stay.
The CMS Certification Number (CCN) assigned to the Medicare certified
hospital facility.
The name of the provider.
The provider’s street address.
The city where the provider is located.
The state where the provider is located.
The provider’s zip code.
The Hospital Referral Region (HRR) where the provider is located.
The number of discharges billed by the provider for inpatient
hospital services.
The provider's average charge for services covered by Medicare
for all discharges in the MS-DRG. These will vary from hospital to
hospital because of differences in hospital charge structures.
The average total payments to all providers for the MS-DRG including
the MS-DRG amount, teaching, disproportionate share, capital, and outlier
payments for all cases. Also included in average total payments are
co-payment and deductible amounts that the patient is responsible for and
any additional payments by third parties for coordination of benefits.
The average amount that Medicare pays to the provider for Medicare's share
of the MS-DRG. Average Medicare payment amounts include the MS-DRG amount,
teaching, disproportionate share, capital, and outlier payments for all
cases. Medicare payments DO NOT include beneficiary co-payments and
deductible amounts norany additional payments from third parties for
coordination of benefits. Note: In general, Medicare FFS claims with
dates-of-service or dates-of-discharge on or after April 1, 2013, incurred
a 2 percent reduction in Medicare payment. This is in response to mandatory
across-the-board reductions in Federal spending, also known as
sequestration. For additional information, <a href= "http://www.cms.gov/Outreach-and-Education/Outreach/FFSProvPartProg/Downloads/
2013-03-08-standalone.pdf">visit</a>.
The initial process of refining the dataset was done in a python script named “u_project_EDA-CMS_python3.ipynb”, the code is presented in “u_project_EDA-CMS_python3.html”; for further details examine the HTML file.
Firstly, we load the packages that we are going to use in this project.
Secondly, we load the dataset from its original location to the data frame df_cms.
## DRG.Definition Provider.Id
## 1 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10001
## 2 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10005
## 3 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10006
## 4 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10011
## 5 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10016
## 6 039 - EXTRACRANIAL PROCEDURES W/O CC/MCC 10023
## Provider.Name Provider.Street.Address
## 1 SOUTHEAST ALABAMA MEDICAL CENTER 1108 ROSS CLARK CIRCLE
## 2 MARSHALL MEDICAL CENTER SOUTH 2505 U S HIGHWAY 431 NORTH
## 3 ELIZA COFFEE MEMORIAL HOSPITAL 205 MARENGO STREET
## 4 ST VINCENT'S EAST 50 MEDICAL PARK EAST DRIVE
## 5 SHELBY BAPTIST MEDICAL CENTER 1000 FIRST STREET NORTH
## 6 BAPTIST MEDICAL CENTER SOUTH 2105 EAST SOUTH BOULEVARD
## Provider.City Provider.State Provider.Zip.Code
## 1 DOTHAN AL 36301
## 2 BOAZ AL 35957
## 3 FLORENCE AL 35631
## 4 BIRMINGHAM AL 35235
## 5 ALABASTER AL 35007
## 6 MONTGOMERY AL 36116
## Hospital.Referral.Region.Description Total.Discharges
## 1 AL - Dothan 91
## 2 AL - Birmingham 14
## 3 AL - Birmingham 24
## 4 AL - Birmingham 25
## 5 AL - Birmingham 18
## 6 AL - Montgomery 67
## Average.Covered.Charges Average.Total.Payments Average.Medicare.Payments
## 1 32963.07 5777.24 4763.73
## 2 15131.85 5787.57 4976.71
## 3 37560.37 5434.95 4453.79
## 4 13998.28 5417.56 4129.16
## 5 31633.27 5658.33 4851.44
## 6 16920.79 6653.80 5374.14
## Census.Region Census.Region.Division Federal.Region
## 1 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## 2 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## 3 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## 4 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## 5 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## 6 SOUTH D6_EAST_SOUTH_CENTRAL REGION_IV
## Economic.Analysis.Region
## 1 SOUTHEAST
## 2 SOUTHEAST
## 3 SOUTHEAST
## 4 SOUTHEAST
## 5 SOUTHEAST
## 6 SOUTHEAST
Before we continue, we check if there are any null values within the data frame df_cms.
## [1] DRG.Definition
## [2] Provider.Id
## [3] Provider.Name
## [4] Provider.Street.Address
## [5] Provider.City
## [6] Provider.State
## [7] Provider.Zip.Code
## [8] Hospital.Referral.Region.Description
## [9] Total.Discharges
## [10] Average.Covered.Charges
## [11] Average.Total.Payments
## [12] Average.Medicare.Payments
## [13] Census.Region
## [14] Census.Region.Division
## [15] Federal.Region
## [16] Economic.Analysis.Region
## <0 rows> (or 0-length row.names)
There are no null values in the data frame. Afterwards, we check the number of entries present in this dataset.
## [1] 163065 16
The number of entries is 163065. The number of entries is significant. Therefore, We proceed to make a quick summary of the data.
## DRG.Definition
## 194 - SIMPLE PNEUMONIA & PLEURISY W CC : 3023
## 690 - KIDNEY & URINARY TRACT INFECTIONS W/O MCC : 2989
## 292 - HEART FAILURE & SHOCK W CC : 2953
## 392 - ESOPHAGITIS, GASTROENT & MISC DIGEST DISORDERS W/O MCC : 2950
## 641 - MISC DISORDERS OF NUTRITION,METABOLISM,FLUIDS/ELECTROLYTES W/O MCC: 2899
## 871 - SEPTICEMIA OR SEVERE SEPSIS W/O MV 96+ HOURS W MCC : 2812
## (Other) :145439
## Provider.Id Provider.Name
## Min. : 10001 GOOD SAMARITAN HOSPITAL : 633
## 1st Qu.:110092 ST JOSEPH MEDICAL CENTER : 427
## Median :250007 MERCY MEDICAL CENTER : 357
## Mean :255570 MERCY HOSPITAL : 347
## 3rd Qu.:380075 ST JOSEPH HOSPITAL : 343
## Max. :670077 ST FRANCIS MEDICAL CENTER: 277
## (Other) :160681
## Provider.Street.Address Provider.City
## 100 MEDICAL CENTER DRIVE: 183 CHICAGO : 1505
## 800 WASHINGTON STREET : 166 BALTIMORE : 1059
## 1 MEDICAL CENTER DRIVE : 142 HOUSTON : 950
## 100 HOSPITAL DRIVE : 106 PHILADELPHIA: 898
## 100 MICHIGAN ST NE : 100 BROOKLYN : 877
## 1000 BLYTHE BLVD : 100 SPRINGFIELD : 807
## (Other) :162268 (Other) :156969
## Provider.State Provider.Zip.Code Hospital.Referral.Region.Description
## CA : 13064 Min. : 1040 CA - Los Angeles : 3653
## TX : 11864 1st Qu.:27261 MA - Boston : 2910
## FL : 11155 Median :44309 GA - Atlanta : 2630
## NY : 9178 Mean :47938 TX - Houston : 2577
## IL : 7909 3rd Qu.:72901 PA - Philadelphia: 2554
## PA : 7804 Max. :99835 TX - Dallas : 2427
## (Other):102091 (Other) :146314
## Total.Discharges Average.Covered.Charges Average.Total.Payments
## Min. : 11.00 Min. : 2459 Min. : 2673
## 1st Qu.: 17.00 1st Qu.: 15947 1st Qu.: 5234
## Median : 27.00 Median : 25246 Median : 7214
## Mean : 42.78 Mean : 36134 Mean : 9707
## 3rd Qu.: 49.00 3rd Qu.: 43233 3rd Qu.: 11286
## Max. :3383.00 Max. :929119 Max. :156158
##
## Average.Medicare.Payments Census.Region
## Min. : 1149 MIDWEST :39087
## 1st Qu.: 4192 NORTHEAST:30230
## Median : 6158 SOUTH :67038
## Mean : 8494 WEST :26710
## 3rd Qu.: 10057
## Max. :154621
##
## Census.Region.Division Federal.Region
## D5_SOUTH_ATLANTIC :34118 REGION_IV :37453
## D3_EAST_NORTH_CENTRAL:27434 REGION_V :29686
## D2_MID-ATLANTIC :21808 REGION_VI :20395
## D7_WEST_SOUTH_CENTRAL:19478 REGION_III:17911
## D9_PACIFIC :17814 REGION_IX :17559
## D6_EAST_SOUTH_CENTRAL:13442 REGION_II :14004
## (Other) :28971 (Other) :26057
## Economic.Analysis.Region
## SOUTHEAST :48468
## GREAT LAKES:27434
## MIDEAST :25994
## FAR WEST :19016
## SOUTHWEST :18152
## PLAINS :11653
## (Other) :12348
Due to the vast array of data and groups found in this particular data frame. We will begin our analysis per regions. We will start with “Census Region” since it encompasses all states within the four central regions of the USA, the divisions within regions, the Federal bank that is in charge of the state as well as the Regions established by the Bureau of Economic Analysis.
This graph contains the number of entry per each of the groups mentioned above.
We will start the initial analysis by obtaining the frequency for each one of the variables of the columns:
## Census.Region freq
## 1 MIDWEST 39087
## 2 NORTHEAST 30230
## 3 SOUTH 67038
## 4 WEST 26710
## Census.Region.Division freq
## 1 D1_NEW_ENGLAND 8422
## 2 D2_MID-ATLANTIC 21808
## 3 D3_EAST_NORTH_CENTRAL 27434
## 4 D4_WEST_NORTH_CENTRAL 11653
## 5 D5_SOUTH_ATLANTIC 34118
## 6 D6_EAST_SOUTH_CENTRAL 13442
## 7 D7_WEST_SOUTH_CENTRAL 19478
## 8 D8_MOUNTAIN 8896
## 9 D9_PACIFIC 17814
## Federal.Region freq
## 1 REGION_I 8422
## 2 REGION_II 14004
## 3 REGION_III 17911
## 4 REGION_IV 37453
## 5 REGION_IX 17559
## 6 REGION_V 29686
## 7 REGION_VI 20395
## 8 REGION_VII 8493
## 9 REGION_VIII 4283
## 10 REGION_X 4859
## Economic.Analysis.Region freq
## 1 FAR WEST 19016
## 2 GREAT LAKES 27434
## 3 MIDEAST 25994
## 4 NEW ENGLAND 8422
## 5 PLAINS 11653
## 6 ROCKY MOUNTAIN 3926
## 7 SOUTHEAST 48468
## 8 SOUTHWEST 18152
In the Census Region, the region South has the highest account of entries with 67038 and the lowest in the West with 26710. In the Divisions in the Census Region, the Division 5 South Atlantic has the hights entries with 34116, and the lowest in the Division 1 New England.
When considering Regions under the different Federal Banks, Region IV is the one with the highest entry and the region with the lowest entries is Region_XIII with 4283. Lastly, The Rocky Mountain Region of the Bureau of Economic Analysis is the region with the lowest entries in 3926. On the other hand, Southeast is the region with the highest entries with 48468.
Before we continue, we will get the summary of the Total Discharges, Average Covered Charges, Average Total Payments and Average Medicare Payments per Census Region.
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 17.00 28.00 43.87 50.00 1487.00
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 18.00 29.00 45.83 53.00 3383.00
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 17.00 27.00 43.81 50.00 1344.00
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 11.00 15.00 23.00 35.13 39.00 1696.00
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3611 14792 21413 29395 34654 353774
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3481 14964 25870 37123 45428 613927
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2459 15416 23890 33361 40136 480540
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4202 23310 37111 51836 61668 929119
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2673 5062 6897 9228 10837 131187
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2765 5610 7824 10515 12138 140255
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2683 4878 6686 8941 10460 99307
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2995 6134 8525 11419 13295 156158
## df_cms$Census.Region: MIDWEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1327 4016 5850 7991 9613 130467
## --------------------------------------------------------
## df_cms$Census.Region: NORTHEAST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1680 4597 6818 9366 11009 133177
## --------------------------------------------------------
## df_cms$Census.Region: SOUTH
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1149 3849 5640 7746 9303 95701
## --------------------------------------------------------
## df_cms$Census.Region: WEST
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1604 5034 7429 10125 11900 154621
In order to continue this analysis, we will proceed to see the relationship between Total Discharges, Averaged Covered Charges (USD), Average Total Payments (USD) and Average Medicare Payments (USD).
From the graph, we can see that the relationship between Total Discharges and Average Covered Charges is that the fewer discharges the more Average Covered Charges. This trend is the same when we compare Total Discharges is plotted alongside Average Total Payments and Average Medicare Payments.
When a correlation test is performed between Total Discharges and the other variables these are the results:
The results indicate that there is no correlation between the number of discharges and the payments made or received by patients or Medicare.
The graph also shows the relationship between Average Covered Charges(USD) and Average Total Payments(USD) and Average Medicare Payments(USD). The graph shows almost a more linear positive relationship between Average Covered Charges(USD) and Average Total Payments(USD) and Average Medicare Payments (USD).
The graph is supported by the data extracted in the correlation test between the variables:
Corr: 0.764
The results indicate that there is a positive relationship between the variable and due to the similarity in results it further supports the similarity between the two relationships.
The last relationship that could be explored from the graph is between Average Total Payments(USD) and Average Medicare Payments(USD). The graph shows a very linear relationship between the two variables, and this is further supported by the correlation test showing a Corr value of 0.992.We will start by analysing the following group of variables together:
The first graph that is going to show the whole data from df_cms. The whole data is graphed in this way in order to acquire the whole pattern of the data between the two variables.
As seen on the graph, the majority of the data is under the x-value 250000$ and y-value 50000$. Due to this the second graph will have the limits mentioned above and also 4 further graphs will be added; the entire data frame data selectively gather under the four census regions of the USA, Midwest, Northeast, South and West.
As we can see from the graph West and Midwest regions, have a similar scatter plot, flatter and more left-leaning. However, Midwest has a more linear shape than the West region.
Northeast and South regions have similar scatterplots with widening x and y profile than other regions. However, the South region have a straight line. It will be interesting to know why it was developed.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
##
## Pearson's product-moment correlation
##
## data: df_cms$Average.Covered.Charges and df_cms$Average.Medicare.Payments
## t = 485.66, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7669354 0.7709033
## sample estimates:
## cor
## 0.7689268
The test shows a cor value of 0.767. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Average Covered Charges and Average Medicare payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data we will explore other analysis between these two variables
This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.
This graph shows that the majority of total discharges are fewer than 250 with an increase average total payments are mostly under 25000$.
The next graph will plot the same graph as above but with limits of x-value 500 and y-value 50000$, as well as four separate graphs with the data frame gather selectively from the regions of the USA.
All graphs share a similar pattern than the one shared from the pattern mention in the general and global graph. There are two distinct protrusions of data reflecting an increase of total discharges with approximately an average total payments of about 10000$; with a constant increase of discharges, similar to a horizontal line. The pattern is repeated about 15000$ Average Total Payments; it is less apparent in the West Region.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
##
## Pearson's product-moment correlation
##
## data: df_cms$Total.Discharges and df_cms$Average.Total.Payments
## t = -6.4699, df = 163060, p-value = 9.833e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.02087213 -0.01116732
## sample estimates:
## cor
## -0.0160201
The test shows a cor value of -0.016. Values between 0 and -0.3 state that there is a weak negative linear relationship between Total Discharges and Average Total Payments.
It is unusual that this the correlation found between these variables; especially since these two variables involve the total of discharges and average total payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.
This first graph is going to deliver an overall picture of the relation between the two data columns; including data patterns.
This graph shows a linear-like pattern. The majority of the entries are within the liner pattern. Also, the majority of entries are within x-value and y-value limits of 50000$.
All graphs show the same pattern. There are mild differences in the width. Further analysis could reveal more about this pattern.
We start the analysis by determining the Correlation between the two variables. We will perform a Pearson test between the two variables without subdividing the data into regions.
##
## Pearson's product-moment correlation
##
## data: df_cms$Average.Total.Payments and df_cms$Average.Medicare.Payments
## t = 2746.3, df = 163060, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.9892588 0.9894642
## sample estimates:
## cor
## 0.989362
The test shows a cor value of 0.989. Values between 0.7 and 0.9 state that there is a steady uphill linear relationship between Total Payments and Average Medicare Payments.
For the moment we will only explore the correlation between the two variables, and in a subsequent analysis with more detail and in-depth data, we will explore other analysis between these two variables.
In the region of the Northeast of the USA, we see that the Interquartile Range is between ~18 for the 25% percentile and ~67 for the 75% percentile.
In the region of the Midwest of the USA, we see that the Interquartile Range is between ~18 for the 25% percentile and ~67 for the 75% percentile.
In the region of the South of the USA, we see that the Interquartile Range is between ~18 for the 25% percentile and ~50 for the 75% percentile.
In the region of the West of the USA, we see that the Interquartile Range is between ~15 for the 25% percentile and ~42 for the 75% percentile.
This section only handles aesthetic modification of these images. The three images chosen for this section are:
The Dataset is one that shows a significant amount of data towards Medicare payment information in hospitals throughout the USA. It consists of 16 variables and 163065 entries. We explore the various relationships between total payments, Medicare payments, Total Discharges, and regions within the USA.
The dataset is vast and much more analysis could be performed, for example, the differences in payment from Medicare and Total payments made per procedure per hospitals in the same region as well as through states or regions in the USA.
The vastness of this dataset it is on itself a struggle, the analysis made in here it is mainly superficial and many more aspects could be analysed further. However, this analysis does give an overview picture of the distribution of payments and discharges stored and how that is represented in the different regions of the USA.
Furthermore, another difficulty obtain was to allocate the data into regions, divisions and so on. No python package is readily available that provides the information required, therefore it needed to be done by hand and checked in a multitude of sources, so the information was accurate.
One question that could further be asked form this dataset is how the different medical procedures are between payments(All the different types of payment related variables in this dataset) and discharges among states, regions and divisions. It is understood that different hospitals have a different payment structure or charge structure per procedure, so a further analysis could show this pattern and how much more significant such a difference is.